A Study of Thyroid Fine Needle Aspiration of Follicular Adenoma in the “Atypia of Undetermined Significance” Bethesda Category Using Digital Image Analysis

Background Originally designed for computerized image analysis, ThinPrep is underutilized in that role outside gynecological cytology. It can be used to address the inter/intra-observer variability in the evaluation of thyroid fine needle aspiration (TFNA) biopsy and help pathologists to gain additional insight into thyroid cytomorphology. Methods We designed and validated a feature engineering and supervised machine learning-based digital image analysis method using ImageJ and Python scikit-learn . The method was trained and validated from 400 low power (100x) and 400 high power (400x) images generated from 40 TFNA cases. Result The area under the curve (AUC) for receiver operating characteristics (ROC) is 0.75 (0.74–0.82) for model based from low-power images and 0.74 (0.69–0.79) for the model based from high-power images. Cytomorphologic features were synthesized using feature engineering and when performed in isolation, they achieved AUC of 0.71 (0.64–0.77) for chromatin, 0.70 (0.64–0.73) for cellularity, 0.65 (0.60–0.69) for cytoarchitecture, 0.57 (0.51–0.61) for nuclear size, and 0.63 (0.57–0.68) for nuclear shape. Conclusion Our study proves that ThinPrep is an excellent preparation method for digital image analysis of thyroid cytomorphology. It can be used to quantitatively harvest morphologic information for diagnostic purpose.


Introduction
As one of the more accessible organs for fine needle aspiration (FNA) biopsy, thyroid nodules are frequently evaluated for cytologic diagnosis to determine surgical versus conservative management. While a subset of thyroid FNA (T-FNA) contains clear cytomorphologic features of neoplastic lesions that can be definitively and reliably diagnosed amongst cytopathologists, up to 21% of cases within some institutions can display cellular and architectural atypia insufficient for definitive diagnosis, leaving a significant element of uncertainty of appropriate management for clinicians to pursue. 1,2 Many indeterminate results due to architectural atypia identified within T-FNAs are reported by pathologists as "atypia of undetermined significance/follicular lesion of undetermined significance (AUS/FLUS)" (Bethesda category III) in the Bethesda System for Reporting Thyroid Cytopathology TBSRTC. While TBSRTC recommends molecular assays for both categories to guide management, 3 many clinicians are seeking lower cost options to enhance the diagnostic accuracy of the existing cytological material, particularly in the indeterminate diagnostic categories.
In our current study, we evaluated an alternative pathway to an objective, reproducible diagnosis by utilizing an existing cytologic preparation technique optimized for digital pathology and machine learning algorithms. 4,5 The use of this technology can provide a substitute pathway to resolve indeterminate diagnostic categories through digital evaluation and classification of cytomorphologic features (follicular group architecture, smear cellularity, amount of colloid, and cytologic atypia) associated with follicular neoplasms. 6 To our best knowledge to date, ThinPrep® is underutilized in this regard but is widely used by many institutions for the evaluation of thyroid aspirate material. ThinPrep® is conveniently primed for digital image analysis (DIA), as it is created to reduce the variability of stains and was originally developed for the ThinPrep Imaging System. 7 In this study, we aim to evaluate the feasibility of applying DIA on T-FNA material prepared by the ThinPrep® procedure and use it to gain more insight to improve the diagnostic accuracy of thyroid aspiration cytology.

Case Collection and Image Capture
To reduce the complexity of the study, we decided to focus on the morphologic difference between surgically verified benign thyroid vs. thyroid with follicular adenoma, as the extent and degree of morphologic criteria are more subjective rendering less reproducible diagnoses in comparison to other thyroid lesions with cytologic (nuclear) atypia such as papillary thyroid lesions.
From our laboratory information system (LIS), we performed a structured query language (SQL) search for all surgical resection cases diagnosed as follicular adenoma or thyroid with nodular hyperplasia. Cross referencing the prior T-FNAs, we identified 20 T-FNAs diagnosed as AUS/ FLUS, with subsequent diagnoses of follicular adenoma on surgical resections and 20 T-FNAs with subsequent diagnoses of benign thyroid nodules on surgical resections. Digital images of 10 mid-power (100x) and 10 highpower (400x) fields on the ThinPrep material were obtained using a DP71 camera (3500 Corporate Parkway, Center Valley, PA 18034, Olympus, USA) on an Olympus BX51 microscope with CellSens Entry v1.12 (Olympus, USA). The mid-power fields were randomly taken to evaluate overall specimen cellularity while the high-power fields captured follicular cells. All images associated with each case were grouped together and further reviewed by a board-certified cytopathologist (ML) to evaluate for adequate cellularity and to render a diagnosis within the Bethesda classification system. Unsatisfactory cases with insufficient cellularity were removed from the study. In total, we curated 800 images through the above process.

Image Analysis
To maximize the use of the images, a custom image analysis algorithm was developed based on cytomorphology feature engineering and supervised machine learning.

Cytomorphology Feature Engineering
We used ImageJ v1.51p (NIH, USA) to develop cytomorphology feature engineering. The process consists of image segmentation followed by feature extraction (Fig. 1). For image segmentation, we started with preprocessing of the images by substracting the background, followed by red-green-blue color channel separation. We only extracted the green channels and created masks for all nuclei using an automatic threshold method. The feature extraction processed focused on the nuclei which were treated as individual "particles" with low-level features. The low-level features are selectively grouped together based on the authors' cytomorphology knowledge to form medium and high level features (Table 1). For example, a medium level feature, nuclear size, or simply size, is composed of mean and standard deviation of nuclear area, which are low level features. Cytology, a high-level feature, is composed of three medium level features, chromatin, shape, and size. For the high-power images, the "particles" were filtered by some low level features such as size and circularity to remove background noise. These low level features were also used to distinguish or "gate" individual nuclei from closely grouped clusters to detect crowding of follicular cells. The "cellularity" high level features were extracted only from the mid-power images. Altogether, we have a total of 86 low level nuclear features used to construct three medium level feature models (chromatin, shape, and size), three high level feature models (cellularity, architecture, and cytology), and two models based on magnification (low and high power).

Supervised Machine Learning
Supervised machine learning methods aim to automatically create algorithms based on known paired input (e.g., features) and expected output (e.g., ground truth) data. Training data are used to optimize the weights Figure 1. The segmentation and feature extraction of each image (A) starts with background subtraction (B), followed by conversion to 8-bit grayscale image (green channel only) through color deconvolution (C), automatic threshold segmentation, and finally a mask (red) for the nuclear features (D). The extracted nuclei features are further "gated" (high-power only) using size and circularity to separate out individual nuclei from closely grouped clusters. and parameters of the algorithms while the validation data were used to validate the generalizability and performance of the trained algorithm.
Utilizing the above rules, the various combination of features based on the models were utilized as the input and surgical report (Follicular adenoma vs benign thyroid) of the T-FNA were the expected output. Follicular adenoma was considered as a positive result. Using Python sklearn library, we used gradient boost classifier (GBC) and extra tree classifier (ETC) as our supervised machine learning methods. The training and validation data were randomly split 1:1 from the collected data using a data splitting algorithm. The process was also repeated three times to further ensure generalization and to prevent overfitting. We also used extra tree classifier to evaluate the importance of low level features using all available data.

Result
The measure of a predictive test performance calls for measurement in accuracy, the closeness of the measurements to a specific value; precision, also known as positive-predictive value; recall, sensitivity. Since all features were used between the high and low power models, their performances are the direct measurement of the DIA algorithm design. Using validation data only, the mid-power model achieved an average accuracy of 0.71 (0.  (Fig. 2). Table 1 gives additional details on the breakdown of prediction accuracy contribution and statistical analyses of all features.
Since the high-and medium-power magnification models have reasonable performance base on the validation results, the high and medium level features models can be considered as statistical hypothesis tests to evaluate the importance of each group of features and their contribution to the accuracy of the models. Based on this method, all three high level features, cellularity, architecture, and cytology appear to contribute significantly. For the medium level features, nuclear chromatin appears to be the strongest contributor while nuclear shape is a distant second. The nuclei size, on the other hand, appears to be non-contributory, a finding collaborated by statistical analysis (P=0.10) of the size variation between T-FNA from follicular adenoma and benign thyroid ( Table 1).
The minimal presence of colloid material in ThinPrep combined with technical limitations prevented the incorporation of these morphologic features into our models.

Discussion
The current evaluation of T-FNA relies on manual visual evaluation by cytopathologists. It is known that while the human visual system is excellent at recognizing patterns, it performs poorly on quantitative tasks and is susceptible to optical illusions. 8 Most suspicious or malignant (Bethesda category IV to VI) T-FNA cases show higher rates of diagnostic reproducibility among cytopathologists as they present with more pronounced architectural and cytologic features. In these cases, there is little need for repeated T-FNA or ancillary molecular tests for further characterization as the evidence for surgical management is well established. However, for Bethesda category III, the degree of cytologic and architectural atypia may be subtle and variable, which explains the high degree of inter-observer variability. 9 Furthermore, the Bethesda criteria for this diagnostic category, whether architectural or cytologic atypia, are not defined in quantifiable methods and therefore are fundamentally subjective.
Our study shows that while looking at the exact same set of images, board-certified cytopathologists may err on the side of caution and sacrifice  overall accuracy. While molecular testing provides an alternative to repeat T-FNA, it comes at a cost of additional needle passes and the assay itself. 10 Additionally, their exact predictive performance for entities like follicular adenoma remains controversial based on existing published data. 11 Our results show that routine T-FNA augmented by DIA using ThinPrep material can produce predictions with the pre-existing diagnostic material with increased overall accuracy by quantitatively evaluating morphologic features. Therefore, concurrent evaluation of preliminarily indeterminate T-FNA with DIA may present as a more cost-effective method for evaluating thyroid nodules without additional biopies or molecular studies. Additionally, as a liquid-based cytology preparation that uses standardized instruments to produce monolayers of well-stained and well-preserved cells, ThinPrep may be further explored for further non-gynecologic image analysis applications. Our DIA design also examines the morphologic difference between T-FNA from follicular adenoma and benign thyroid. The performances of the high level feature models show cellularity, architecture, and cytology appear to contribute to the accuracy of the models (Table 1 and Fig. 2). ThinPrep material from follicular adenoma has a higher degree of cellularity, greater follicular cell crowding, and quantifiable nuclear difference than benign thyroid (Table 1 and Fig. 2). Further characterization of the nuclear morphology profiles using the medium level features shows nuclear chromatin appears to be the strongest contributor to accuracy while the nuclear shape is a distant second. The nuclear size was not a discriminating feature (with AUC close to 0.5) and this finding is further supported by the student T-test (P > 0.05) for nuclear size ( Table 1). Characterization of the nuclear chromatin profile and shape difference beyond the listed performance and statistical metrics is suboptimal due to limitations of sample size and technical limitations (Table 1). However, the above findings reaffirm that cellularity, chromatin texture, and architectural features are diagnostically important in ThinPrep-based T-FNA for follicular adenoma.
To the best of our knowledge, this is the first attempt to apply DIA to simultaneously build predictive models to better separate indeterminate thyroid diagnostic categories (Bethesda III) and to investigate T-FNA cytomorphology in ThinPrep material. While T-FNA cytomorphology is well studied on manually made smears, the decades of utilization in computer image analysis assisted diagnosis for gynecologic cytology (e.g., ThinPrep Imaging System) and the recent advances in digital image analysis merit a second look for expanded applications for liquid-based preparation such as ThinPrep. 12 Limitations of our current study include the low number of cases in the dataset and comparing DIA against a single cytopathologist. The scope of the DIA algorithm is currently limited to T-FNA of follicular adenoma or benign thyroid nodule with ThinPrep material. A whole slide imaging method was not used due to limited development time. We do believe that mid-and high-power models can sufficiently capture the vast majority of the morphologic features and thus this study can serve as proof-of-concept and pave ways for more advanced future studies to build DIA-based decisionsupport tools for T-FNA.